Lecture 16: Ridge Regression and LASSO
University of British Columbia Okanagan
Recall the linear model:
\[ Y = \beta_0 + \beta_1 X_1 + ··· + \beta_p X_p + \epsilon \]
In ordinary least squares regression, estimates of the regression coefficients (i.e. the \(\beta\)s) are found by minimizing the residuals sum of squares (RSS) (this as our “cost” function)
\[RSS = \sum_{i=1}^n (y_i - \hat y_i)^2\]
We’ve now seen a couple of alternatives to the linear model for regression.
BUT, the linear model still reigns supreme in most realms of science due to its simplicity (which helps for inference).
We can improve upon the linear model (both in terms of prediction accuracy and model interpretibility) by replacing ordinary least squares fitting with some alternative fitting procedures: shrinkage (aka regularization) methods.
Prediction Accuracy
Model Interpretability
Think about fitting a OLS line to a single observation.
Think about fitting a OLS line to a single observation.
Think about fitting a OLS line to a single observation.
Think about fitting a OLS line to a single observation.
Think about fitting a OLS line to two data points
If n, the number of training observations, is much larger than p, the number of predictors, there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training.
Instead of least squares, we will consider an alternative fitting procedure that will introduce a small amount of bias to get back a significant drop in variance (thereby improving prediction accuracy).
To combat this, we could use subset selection1 wherein only a subset of the \(p\) predictors that we believe to be related to the response are used to fit the model (see Ch 6.1 in ISLR)
Using cross-validated prediction error, for example, we could select a single best model from all possible models.
For large \(p\), however, an exhaustive search of every possible subset of predictors is not computationally feasible2; furthermore it will be liable to overfit to the data.
Ridge regression (RR) includes a penalty in the estimation process and aims to minimize:
\[\begin{gather} RSS + \lambda \sum_{j=1}^p \beta_j^2 = \sum_{i=1}^n ( y_i - \hat y_i)^2 + \lambda \sum_{j=1}^p \beta_j^2\\ = \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij})^2+ \lambda \sum_{j=1}^p \beta_j^2 \end{gather}\] where \(\lambda \geq 0\) is a tuning parameter that shrinks the \(\beta_j\) estimates towards 0.
As \(\lambda\) increases the coefficient estimates will tend to “shrink” towards 0.
Hence minimizing \(RSS + \lambda \sum_{j=1}^p \beta_j^2\) will encourage smaller (in magnitude) values for \(\beta_j\)
Credit
The Credit
dataset contains information about credit card card holders and includes 11 variables including:
Balance
the average credit card debt for each individual (response) and several quantitative predictors:
Age
, Cards
(number of credit cards), Education
(years of education), Income
(in thousands of dollars), Limit
(credit limit), and Rating
(credit rating).Fitting a least squares regression model to this data, we get…
Credit
fit...
Estimate Std. Error t value Pr(>|t|)
(Intercept) -479.20787 35.77394 -13.395 < 2e-16 ***
Income -7.80310 0.23423 -33.314 < 2e-16 ***
Limit 0.19091 0.03278 5.824 1.21e-08 ***
Rating 1.13653 0.49089 2.315 0.0211 *
Cards 17.72448 4.34103 4.083 5.40e-05 ***
Age -0.61391 0.29399 -2.088 0.0374 *
Education -1.09886 1.59795 -0.688 0.4921
OwnYes -10.65325 9.91400 -1.075 0.2832
StudentYes 425.74736 16.72258 25.459 < 2e-16 ***
MarriedYes -8.53390 10.36287 -0.824 0.4107
RegionSouth 10.10703 12.20992 0.828 0.4083
RegionWest 16.80418 14.11906 1.190 0.2347
---
...
With an Adjusted \(R^2\) of 0.9538.
Unlike least squares, ridge regression will produce a different set of coefficient estimates, \(\beta^R_\lambda\), for each value of \(\lambda\)
ISLR Fig 6.4: The standardized ridge regression coefficients are displayed for the Credit
data set, as a function of \(\lambda\) and standardized \(\ell_2\) norm.
Another way we could have presented the penalty is through the norms. The p-norm (or \(\ell_p\) norm) is defined as:
\[ || x ||_p = \left(\sum_i |x_i|^p\right)^{1/p} \]
The 2-norm is Euclidean distance, \(|| x ||_2 = \sqrt{\left(\sum_i x_i^2\right)} = \sqrt{x_1^2 + x_2^2 + \ldots + x_i^2}\), and the Manhattan distance is the 1-norm (aka L1 norm): \(|| x ||_1 = \sum_i |x_i| = |x_1| + |x_2| + \ldots + |x_i|\)
The squared L2 norm in used in the Ridge Regression penalty:
\[ \lambda || \beta ||_2^2 = \lambda \sqrt{ \beta_1^2 + \beta_2^2 + \ldots + \beta_p^2 }^2 = \lambda\sum_{i=1}^p \beta_i^2 \]
This process of adding a norm to our cost function is known as regularization.
Hence this Ridge Regression is also known as L2 regularization.
Clearly our choice for \(\lambda\) will be an important one (can you guess how we will find it?)
If \(\lambda=0\) then the estimation process is simply least squares.
As \(\lambda\) increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.
As \(\lambda \rightarrow \infty\) then the penalty grows and the coefficients approach (but never equal) 0; this corresponds to the null model that contains no predictors.
A simulated data set containing \(p = 45\) predictors and \(n = 50\) observations.
ILSR 6.5 Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of \(\lambda\) (left) and the standardize \(\ell_2\) norm (right). The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the models for which the MSE is smallest.
Least squares coefficients are scale equivariant
In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant
The least absolute shrinkage and selection operator or LASSO is arguably the most popular modern method applied to linear models.
It is quite similar to ridge regression only now it minimizes \[\begin{align*} RSS + \lambda \sum_{j=1}^p |\beta_j| \end{align*}\]
Important advantage: it will force some coefficients to 0 (whereas RR will only force them “close” to 0)
Another way that you might see this model being presented is using the L1 norm; in other words, the LASSO uses the \(\ell_1\) penalty:
\[ \begin{align} \lambda || \beta ||_1 &= \lambda\left( | \beta_1 | + | \beta_2 | + \ldots + | \beta_p | \right)\\ &= \lambda\sum_{i=1}^p | \beta_i | \end{align} \]
Hence this method is also referred to as L1 regularization.
Unlike least squares, ridge regression will produce a different set of coefficient estimates, \(\beta^R_\lambda\), for each value of \(\lambda\)
ILSR FIg 6.6: The standardized lasso coefficients on the Credit
data set are shown as a function of \(\lambda\) and standardized \(\ell_2\) norm.
It is worth repeating that the \(\ell_1\) penalty forces some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large
Hence the lasso performs variable selection
As a result, models generated from the lasso are generally much easier to interpret than those of ridge regression
We say that the lasso yields “sparse” models — that is, sparse models that involve only a subset of the variables.
The lasso implicitly assumes that a number of the coefficients truly equal zero
When this is not the case, i.e. all predictors are useful, Ridge Regression would leads to better prediction accuracy
These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other.
The choice between Ridge Regression and LASSO can also boil down to the Prediction vs. Inference trade off (aka Accuracy-Interpretability trade off);
One might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero
Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size.
Since we do not typically know this, a technique such as cross-validation can be used in order to determine which approach is better on a particular data set.
These methods are most suitable when \(p > n\).
Even when \(p < n\) these approaches (especially lasso) is popular with large data sets to fit a linear regression model and perform variable selection simultaneously.
Being able to find, so-called “sparse” (as opposed to “dense”) models that select a small number of predictive variables in high-dimensional datasets is tremendously important/useful.
Less obviously, when you have multicollinearity1 in your data, the least squares estimates have high variance.
Furthermore, multicollinearity makes it difficult to distinguish which which variables (if any) are truly useful.
Using the same argument as before, we can use ridge regression and lasso to reduce the variance and improve prediction accuracy and simultaneous shrink the coefficients of redundant predictors.
Credit
Both the lasso and ridge regression require specification of \(\lambda\)
In fact, for each value of \(\lambda\) we will see different coefficients.
So how do we find the best one?
The answer, as you may have guessed, is cross-validation.
glmnet
To perform regression with regularization we use the function glmnet
function (from package glmnet )
Note that glmnet
does not allow the formula argument.
Hence we will need specify x
and y
.
N.B. there is a handy helper function model.matrix()
that will help us do this.
glmnet
penaltyThe penalty used is: \[ \frac{(1-\alpha)}{2} || \beta ||_2^2 + \alpha ||\beta||_1 \]
alpha
=1 is the lasso penalty \(||\beta||_1\) (default)
alpha
=0 is the ridge penalty
Values between 0 and 1 are allowed and correspond to elasticnet (this model is not discuss here or in the book).
glmnet
\(\lambda\)
A grid of values for the regularization parameter \(\lambda\) is considered
nlambda
is the number of \(\lambda\) values considered in the grid (the default is 100)
The actually values are determined by nlambda
and lambda.min.ratio
(see the help file for more on this)
Alternatively, the user may supply a sequence for \(\lambda\) (in the optional lambda
argument), but this is not recommended.
glmnet
standardizing and familyBy default the function standardizes the data1
In order to determine the best choice for \(\lambda\) we use cross-validation
The cv.glmnet
function from the same package will perform \(k\)-fold cross-validation for glmnet
(default is 10-fold)
From that we produce a plot that returns good value(s) for \(\lambda\).
The model.matrix()
function is particularly useful for creating x
; not only does it produce a matrix corresponding to the predictors but it also automatically transforms any qualitative variables into dummy variables1
Credit
To see the values for the coefficient estimates use coef()
. For example to the see coefficients produced when \(\lambda\) the 60th value in our sequence ( log \(\lambda\) = 0.4938434 in this case):
Credit
set.seed(10) # use cross-validation to choose the tuning parameter
cv.out.rr <- cv.glmnet(x, y, alpha = 0)
plot(cv.out.rr); log(bestlam <- cv.out.rr$lambda.min)
[1] 3.680249
set.seed(1) # use cross-validation to choose the tuning parameter
cv.out.lasso <- cv.glmnet(x, y, alpha = 1)
plot(cv.out.lasso); log(bestlam <- cv.out.lasso$lambda.min)
[1] -0.5295277
The cv.glmnet
ouptut plot produced from a simulation that you will see in lab
Beer
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.7174521 0.6807382 5.461 8.56e-07 ***
qlty -0.0111435 0.0064446 -1.729 0.0887 .
cal -0.0055034 0.0089957 -0.612 0.5429
alc 0.0764520 0.1752318 0.436 0.6641
bitter 0.0677117 0.0153219 4.419 3.98e-05 ***
malty 0.0002832 0.0099639 0.028 0.9774
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8661 on 63 degrees of freedom
Multiple R-squared: 0.6678, Adjusted R-squared: 0.6415
F-statistic: 25.33 on 5 and 63 DF, p-value: 6.588e-14
...
beer
beer
beer
beer
cbind(coef(beer.lm), # OLS (least squares estimates)
# LASSO estimates ...
coef(lasso, s = bestlam.lasso), # ... at best lambda
coef(lasso, s = lambda2)) # ... at lambba within 1 s.e.
6 x 3 sparse Matrix of class "dgCMatrix"
s1 s1
(Intercept) 3.7174521392 3.334440257 3.25063467
qlty -0.0111434572 -0.008782345 .
cal -0.0055034319 . .
alc 0.0764519623 . .
bitter 0.0677117274 0.062008057 0.04831986
malty 0.0002831604 . .
Ridge regression and LASSO works best in situations where the least squares estimates have high variance.
This can happen when \(p\) exceeds (or is close to \(n\))
It can also happen when we have multicollinearity
LASSO shrinks coefficient estimates to exactly 0 hence it can be viewed as a variable selection technique
Ridge Regression shrinks coefficient estimates close to 0 so it never actually eliminates any predictors.
The coefficient estimates can be reformulated to solving the following two problems.
Lasso regression solve:
\[ \text{minimize} \quad \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij}\right)^2 \quad \text{subject to} \quad \sum_{j=1}^{p} |\beta_j| \leq s \]
Ridge regression solves:
\[ \text{minimize} \quad \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij}\right)^2 \quad \text{subject to} \quad \sum_{j=1}^{p} \beta_j^2 \leq s \]
Contours of the error and constraint functions for the lasso (left) and ridge regression (right). The solid blue areas are the constraint regions, \(|\beta_1| + |\beta_2| \leq s\) and \(\beta_1^2 + \beta_2^2 \leq s\), while the red ellipses are the contours of the RSS.